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Abstract 

We propose a simple, scalable, and fast gradient descent algorithm to optimize a noncon- 
vex objective for the rank minimization problem and a closely related family of semidefinite 
programs. With 0{r^K?n log n) random measurements of a positive semidefinite n x n matrix 
of rank r and condition number k, our method is guaranteed to converge linearly to the global 
optimum. 


1 Introduction 

Semidefinite programming has beeome a key optimization tool in many areas of applied math¬ 
ematics, signal processing and machine learning. SDPs often arise naturally from the problem 
structure, or are derived as surrogate optimizations that are relaxations of difficult combinatorial 
problems [7, 1, 8]. In spite of the importance of SDPs in principle—promising efficient algorithms 
with polynomial runtime guarantees—it is widely recognized that current optimization algorithms 
based on interior point methods can handle only relatively small problems. Thus, a considerable 
gap exists between the theory and applicability of SDP formulations. Scalable algorithms for 
semidefinite programming, and closely related families of nonconvex programs more generally, 
are greatly needed. 

A parallel development is the surprising effectiveness of simple classical procedures such as 
gradient descent for large scale problems, as explored in the recent machine learning literature. In 
many areas of machine learning and signal processing such as classification, deep learning, and 
phase retrieval, gradient descent methods, in particular first order stochastic optimization, have led 
to remarkably efficient algorithms that can attack very large scale problems [3, 2, 10, 6]. In this 
paper we build on this work to develop first-order algorithms for solving the rank minimization 
problem under random measurements and a closely related family of semidefinite programs. Our 
algorithms are efficient and scalable, and we prove that they attain linear convergence to the global 
optimum under natural assumptions. 

The affine rank minimization problem is to find a matrix X* G of minimum rank satis¬ 
fying constraints A{X*) = b, where A : —y is an affine transformation. The underde- 
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termined case where m np is of partieular interest, and ean be formulated as the optimization 


rank(X) 
subjeetto A{X) = b. 


mm 


( 1 ) 


This problem is a direet generalization of eompressed sensing, and subsumes many maehine learn¬ 
ing problems sueh as image eompression, low rank matrix eompletion and low-dimensional metrie 
embedding [18, 12]. While the problem is natural and has many applieations, the optimization is 
noneonvex and ehallenging to solve. Without eonditions on the transformation A or the minimum 
rank solution X*, it is generally NP hard [15]. 

Existing methods, sueh as nuelear norm relaxation [18], singular value projeetion (SVP) [11], 
and alternating least squares (AltMinSense) [12], assume that a eertain restrieted isometry 
property (RIP) holds for A. In the random measurement setting, this essentially means that at 
least 0{r{n + p) log(n -f p)) measurements are available, where r = rank(X*) [18]. In this 
work, we assume that (i) X* is positive semidefinite and (ii) A : — > MX is defined as 

A{X)i = tr{AiX), where eaeh Ai is a random nx n symmetrie matrix from the Gaussian Orthog¬ 
onal Ensemble (GOE), with {Ai)jj ~ 2) and {Atjjk ~ A/'(0,1) for j 7^ k. Our goal is thus to 

solve the optimization 


mm 

xxo 


rank(X) 

subjeetto iv{AiX) = bi, i = 1, 


( 2 ) 


, m. 


In addition to the wide applieability of affine rank minimization, the problem is also elosely eon- 
neeted to a elass of semidefinite programs. In Seetion 2, we show that the minimizer of a partieular 
elass of SDP ean be obtained by a linear transformation of X*. Thus, effioient algorithms for 
problem (2) ean be applied in this setting as well. 

Noting that a rank-r solution X* to (2) ean be deeomposed as X* = Z*Z*~^ where Z* G 
our approaeh is based on minimizing the squared residual 

- if = T- E -A. 


2=1 


While this is a noneonvex funetion, we take motivation from reeent work for phase retrieval by 
Candes et al. [6], and develop a gradient deseent algorithm for optimizing f{Z), using a earefully 
eonstrueted initialization and step size. Our main eontributions eoneerning this algorithm are as 
follows. 

• We prove that with 0{r^n log n) eonstraints our gradient deseent seheme ean exaetly reeover 
X* with high probability. Empirieal experiments show that this bound may potentially be 
improved to 0{rn log n). 

• We show that our method eonverges linearly, and has lower eomputational eost eompared 
with previous methods. 
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• We carry out a detailed comparison of rank minimization algorithms, and demonstrate that 
when the measurement matrices Ai are sparse, our gradient method significantly outperforms 
alternative approaches. 

In Section 3 we briefly review related work. In Section 4 we discuss the gradient scheme in 
detail. Our main analytical results are presented in Section 5, with detailed proofs contained in 
the supplementary material. Our experimental results are presented in Section 6, and we conclude 
with a brief discussion of future work in Section 7. 


2 Semidefinite Programming and Rank Minimization 


Before reviewing related work and presenting our algorithm, we pause to explain the connection 
between semidefinite programming and rank minimization. This connection enables our scalable 
gradient descent algorithm to be applied and analyzed for certain classes of SDPs. 

Consider a standard form semidefinite program 


inin tr(CX) 

xr^o 

subject to ti{AiX) = bi, i = 1 ,... ,m 


(3) 


where (7, Ai,..., A^ € If C is positive definite, then we can write C = LLA where L G 

is invertible. It follows that the minimum of problem (3) is the same as 

min tr(X) 

xbo (4) 

subject to ti{AiX) = bi, i = 1,... ,m 

where A^ = L~^AiL~^~^. In particular, minimizers X* of (3) are obtained from minimizers X* of 
(4) via the transformation 

X* = L-^^X*L-\ 

Since X is positive semidefinite, tr(X) is equal to ||7f||^. Hence, problem (4) is the nuclear norm 
relaxation of problem (2). Next, we characterize the specific cases where X* = X*, so that the 
SDP and rank minimization solutions coincide. The following result is from Recht et al. [18]. 

Theorem 1. Let A : —> M”* be a linear map. For every integer k with 1 < k < n, define 

the k-restricted isometry constant to be the smallest value 6k such that 


{l-5k)\\X\\^<\\A{X)\\<{l + 6 k)\\X\\^ 

holds for any matrix X of rank at most k. Suppose that there exists a rank r matrix X* such 
that A{X*) = b. If 62 r < 1, then X* is the only matrix of rank at most r satisfying A{X) = b. 
Furthermore, ifb^r < 1/10, then X* can be attained by minimizing ||X||^ over the affine subset. 
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In other words, since ^ 2 r < <^ 5 r> if ^ 5 r <1/10 holds for the transformation A and one finds a 
matrix X of rank r satisfying the affine constraint, then X must be positive semidefinite. Hence, 
one can ignore the semidefinite constraint X ^ 0 when solving the rank minimization (2). The 
resulting problem then can be exactly solved by nuclear norm relaxation. Since the minimum rank 
solution is positive semidefinite, it then coincides with the solution of the SDP (4), which is a 
constrained nuclear norm optimization. 

The observation that one can ignore the semidefinite constraint justifies our experimental com¬ 
parison with methods such as nuclear norm relaxation, SVP, and AltMinSense, described in the 
following section. 


3 Related Work 


Burer and Monteiro [4] proposed a general approach for solving semidefinite programs using fac¬ 
tored, nonconvex optimization, giving mostly experimental support for the convergence of the 
algorithms. The first nontrivial guarantee for solving affine rank minimization problem is given by 
Recht et al. [18], based on replacing the rank function by the convex surrogate nuclear norm, as 
already mentioned in the previous section. While this is a convex problem, solving it in practice is 
nontrivial, and a variety of methods have been developed for efficient nuclear norm minimization. 
The most popular algorithms are proximal methods that perform singular value thresholding [5] 
at every iteration. While effective for small problem instances, the computational expense of the 
SVD prevents the method from being useful for large scale problems. 

Recently, Jain et al. [11] proposed a projected gradient descent algorithm SVP (Singular Value 
Projection) that solves 


min 

XeM"xp 


\\AiX)-bf 


subject to rank(X) < r, 


where || ■ || is the £2 vector norm and r is the input rank. In the (t + l)th iteration, SVP updates 
as the best rank r approximation to the gradient update X*— fxA^ {A{X^) — b), which is constructed 
from the SVD. If rank(X*) = r, then SVP can recover X* under a similar RIP condition as the 
nuclear norm heuristic, and enjoys a linear numerical rate of convergence. Yet SVP suffers from 
the expensive per-iteration SVD for large problem instances. 

Subsequent work of Jain et al. [12] proposes an alternating least squares algorithm AltMinSense 
that avoids the per-iteration SVD. AltMinSense factorizes X into two factors U G V E 
such that X = UV~^ and minimizes the squared residual ||<l(f/V^) — 6|| by updating U and 
V alternately. Each update is a least squares problem. The authors show that the iterates obtained 
by AltMinSense converge to X* linearly under a RIP condition. However, the least squares 
problems are often ill-conditioned, it is difficult to observe AltMinSense converging to X* in 
practice. 

As described above, considerable progress has been made on algorithms for rank minimization 
and certain semidefinite programming problems. Yet truly efficient, scalable and provably conver¬ 
gent algorithms have not yet been obtained. In the specific setting that X* is positive semidefinite, 
our algorithm exploits this structure to achieve these goals. We note that recent and independent 
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work of Tu et al. [21] proposes a hybrid algorithm called Procrustes Flow (PF), which uses a few 
iterations of SVP as initialization, and then applies gradient descent. 


4 A Gradient Descent Algorithm for Rank Minimization 

Our method is described in Algorithm 1. It is parallel to the Wirtinger Flow (WF) algorithm for 
phase retrieval [6], to recover a complex vector a: G C" given the squared magnitudes of its linear 
measurements bi = |(aj,x)p, i G [m], where G C"^. Candes et al. [6] propose a 

first-order method to minimize the sum of squared residuals 

n 

fwF{z) ={\{ai, z)\‘^ - biY . (5) 

i=l 

The authors establish the convergence of WF to the global optimum—given sufficient measure¬ 
ments, the iterates of WF converge linearly to x up to a global phase, with high probability. 

If z and the a^s are real-valued, the function /„f (z) can be expressed as 

n 

fwwiz) = ^ z — x^aittjx) , 

i=l 

which is a special case of f{Z) where Ai = UioJ and each of Z and X* are rank one. See Figure la 
for an illustration; Figure lb shows the convergence rate of our method. Our methods and results 
are thus generalizations of Wirtinger flow for phase retrieval. 

Before turning to the presentation of our technical results in the following section, we present 
some intuition and remarks about how and why this algorithm works. For simplicity, let us assume 
that the rank is specified correctly. 

Initialization is of course crucial in nonconvex optimization, as many local minima may be 
present. To obtain a sufficiently accurate initialization, we use a spectral method, similar to those 
used in [17, 6]. The starting point is the observation that a linear combination of the constraint 
values and matrices yields an unbiased estimate of the solution. 

Lemma 1. Let M = A biAi. Then |E(M) = X*, where the expectation is with respect to 
the randomness in the measurement matrices Ai. 

Based on this fact, let X* = U*TjU*~^ be the eigenvalue decomposition of X*, where U* = 
u*] and S = diag((Ji,..., cr^) such that ai > ... > cr^ are the nonzero eigenvalues of 
X*. Let Z* = [/*E 2 . Clearly, u* = z*/ ||z*|| is the top sth eigenvector of E(M) associated with 

eigenvalue 2 || 2 ;*||^. Therefore, we initialize according to z^ = where {vs, Xs) is the top 

sth eigenpair of M . For sufficiently large m, it is reasonable to expect that Z^ is close to Z*; this 
is confirmed by concentration of measure arguments. 

Certain key properties of f{Z) will be seen to yield a linear rate of convergence. In the analysis 
of convex functions, Nesterov [16] shows that for unconstrained optimization, the gradient descent 
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Figure 1: (a) An instance of f{Z) where X* G is rank-1 and Z G The underlying truth 
is Z* = [1,1]^. Both Z* and — Z* are minimizers. (b) Linear convergence of the gradient scheme, 
for n = 200, m = 1000 and r = 2. The distance metric is given in Definition 1. 


scheme with sufficiently small step size will converge linearly to the optimum if the objective 
function is strongly convex and has a Lipschitz continuous gradient. However, these two properties 
are global and do not hold for our objective function /(Z). Nevertheless, we expect that similar 
conditions hold for the local area near Z*. If so, then if we start close enough to Z*, we can achieve 
the global optimum. 

In our subsequent analysis, we establish the convergence of Algorithm 1 with a constant step 
size of the form /i/ ||Z*||^, where /i is a small constant. Since ||Z*||^ is unknown, we replace it by 

5 Convergence Analysis 

In this section we present our main result analyzing the gradient descent algorithm, and give a 
sketch of the proof. To begin, note that the symmetric decomposition of X* is not unique, since 
X* = (Z*f/)(Z*t/)^ for any r X r orthonormal matrix U. Thus, the solution set is 

5 = |z G I Z = Z*U for some U with UU^ = U^U = /} . 

Note that \\Z\\p = ||X*||^ for any Z e S. We define the distance to the optimal solution in terms 
of this set. 

Definition 1. Define the distance between Z and Z* as 

d(Z,Z*)= min IIZ - Z*t/|L = min IIZ - ZlL. 

^ ^ uu^=u^u=i" ze5 " 
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Algorithm 1: Gradient descent for rank minimization 


■■ > |A. 


input: 

initialization 

Set (vi, Ai),..., {vr, Xr) to the top r eigenpairs of ^ YlT=i |Ai| > ■ 

= [zi, ..., 2 ;°] where ■Vs,se [r] 

k^O 

repeat 

V/(^'=) = - 6,) AZ’^ 


^k+l _ _ 

k k + 1 
until convergence', 

output: X = Z'^Z^'^ 




e:=iIa.i/2 


V/(Z' 


Our main result for exact recovery is stated below, assuming that the rank is correctly speci¬ 
fied. Since the true rank is typically unknown in practice, one can start from a very low rank and 
gradually increase it. 

Theorem 2. Let the condition number k = oijar denote the ratio of the largest to the smallest 
nonzero eigenvalues of X*. There exists a universal constant cq such that ifm> CQrfr^nlogn, 
with high probability the initialization Z^ satisfies 


d{Z^,Z*) < 



( 6 ) 


Moreover, there exists a universal constant ci such that when using constant step size p/ ||2’*||^ 

with p < — and initial value Z^ obeying (6), the kth step of Algorithm 1 satisfies 
Kn 


with high probability. 


d{ZfZ*) < 



k/2 


We now outline the proof, giving full details in the supplementary material. The proof has four 
main steps. The first step is to give a regularity condition under which the algorithm converges 
linearly if we start close enough to Z*. This provides a local regularity property that is similar 
to the Nesterov [16] criteria that the objective function is strongly convex and has a Lipschitz 
continuous gradient. 

Definition 2. Let Z = arg min^g^ 11Z — 11 ^ denote the matrix closest to Z in the solution set. 

We say that f satisfies the regularity condition RC{e, a, (3) if there exist constants a, fi such that 
for any Z satisfying d{Z, Z*) < e, we have 

(Vf(Z), Z-Z)>-a,\\Z- l|V/(Z)|| J. 

" pW^Wf 
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Using this regularity condition, we show that the iterative step of the algorithm moves closer 
to the optimum, if the current iterate is sufficiently close. 

Theorem 3. Consider the update ^V/(Z^). If f satisfies RC{e, a, fi), 

d{Z^,Z*) < e, andO < p, < m.m{a/2,2/fi), then 

d{Z’^+\ Z*) <\l- ^d{Z\ Z*). 

V anr 

In the next step of the proof, we condition on two events that will be shown to hold with high 
probability using concentration results. Let 5 denote a small value to be specified later. 

A1 For any m e M” such that ||m|| < ^/dl, 

5 
r 


^ III 

— (u^ Aiu)Ai — 2uu^ 

m 


A2 For any Z E S, 


d^f{Z) 


dzsdzj 


-E 


dzsdzJ 


< for all s,k G [r] 


Here the expectations are with respect to the random measurement matrices. Under these assump¬ 
tions, we can show that the objective satisfies the regularity condition with high probability. 

Theorem 4. Suppose that A1 and A2 hold. If 5 < then f satisfies the regularity condition 

24, SlS/tn) with probability at least 1 — mCe~^, where C, p are universal constants. 

Next we show that under Al, a good initialization can be found. 

m 

Theorem 5. Suppose that Al holds. Let be the top r eigenpairs of M = ^ hiAi 

i=l 

such that I All > ■ ■ ■ > jA^j. Let = [zi,..., Zr] where Zg = \J ^ ■ Vg, s E [r]. If 5 < then 

d{z^, z*) < v/doTie. 


Finally, we show that conditioning on Al and A2 is valid since these events have high proba¬ 
bility as long as m is sufficiently large. 


Theorem 6. If the number of samples m > 
satisfying ||m|| < ^Jd\, 


42 

min((52/r2(T^, bjraf) 


n log n, then for any u ^ 


1 

m 


m 

Aiu)Ai — 2uv^ 



r 


II i=i II 

holds with probability at least 1 — mCe~^^ — where C and p are universal constants. 
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Theorem 7. For any x G M", ifm> 


/Ar‘^a‘1, 5l2rai) 


n log n, then for any Z G 


5 


d^jZ) 

dzsdzj 


-E 


dzsdzj 



for all s,k E [r], 


with probability at least 1 — Qme 


Note that since we need 5 < min ( = we have — < 1, and the number of measure- 

— \ 16 ■ 4^r j ' ’ rtJi — ’ 

ments required by our algorithm scales as logn), while only 0{r‘^nfn logn) samples are 

required by the regularity condition. We conjecture this bound could be further improved to be 
0{rn logn); this is supported by the experimental results presented below. 

Recently, Tu et al. [21] establish a tighter O(r^n^n) bound overall. Specifically, when only one 
single SVP step is used in preprocessing, the initialization of PF is also the spectral decomposition 
of |M. The authors show that Ofr'^nfn) measurements are sufficient for the initial solution to 
satisfy d{Z'^^ Z*) < 0(^/bf) with high probability, and demonstrate an 0{rn) sample complexity 
for the regularity condition. 


6 Experiments 

In this section we report the results of experiments on synthetic datasets. We compare our gradient 
descent algorithm with nuclear norm relaxation, SVP and AltMinSense for which we drop the 
positive semidefiniteness constraint, as justified by the observation in Section 2. We use ADMM 
for the nuclear norm minimization, based on the algorithm for the mixture approach in Tomioka 
et al. [19]; see Appendix G. For simplicity, we assume that AltMinSense, SVP and the gradient 
scheme know the true rank. Krylov subspace techniques such as the Lanczos method could be 
used compute the partial eigendecomposition; we use the randomized algorithm of Halko et al. [9] 
to compute the low rank SVD. All methods are implemented in MATLAB and the experiments 
were run on a MacBook Pro with a 2.5GHz Intel Core i7 processor and 16 GB memory. 

6.1 Computational Complexity 

It is instructive to compare the per-iteration cost of the different approaches; see Table 1. Suppose 
that the density (fraction of nonzero entries) of each A* is p. For AltMinSense, the cost of 
solving the least squares problem is + rmnfrp). The other three methods have 

Ofmiifp) cost to compute the affine transformation. For the nuclear norm approach, the Ofif) 
cost is from the SVD and the Ofnnf) cost is due to the update of the dual variables. The gradient 
scheme requires 2n^r operations to compute Z^Z^ and to multiply Z^ hy n x n matrix to obtain 
the gradient. SVP needs Ofnfr) operations to compute the top r singular vectors. However, in 
practice this partial SVD is more expensive than the 2n^r cost required for the matrix multiplies in 
the gradient scheme. 
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Method 


Complexity 


nuelear norm minimization via ADMM 0{m'n?p + m? + n^) 


gradient deseent 

SVP 

AltMinSense 


0{mn‘^p) + 2n^r 
0{mn^p + n^r) 

0{mn^r‘^ + + mn^rp) 


Table 1: Per-iteration computational complexities of different methods. 


Clearly, AltMinSense is the least efficient. For the other approaches, in the dense case (p 
large), the affine transformation dominates the computation. Our method removes the overhead 
caused by the SVD. In the sparse case (p small), the other parts dominate and our method enjoys a 
low cost. 

6.2 Runtime Comparison 

We conduct experiments for both dense and sparse measurement matrices. AltMinSense is 
indeed slow, so we do not include it here. 

In the first scenario, we randomly generate a 400 x 400 rank-2 matrix X* = xx'^ + yy~^ 
where x,y ^ A^(0, J). We also generate m = 6n matrices Ai,..., Am from the GOE, and 
then take b = A{X*). We report the relative error measured in the Frobenius norm defined as 
IIX — X*||^/||X*||i7’. For the nuclear norm approach, we set the regularization parameter to A = 
10“^. We test three values p = 10,100,200 for the penalty parameter and select p = 100 as it leads 
to the fastest convergence. Similarly, for SVP we evaluate the three values 5 x 10“®, 10“"^, 2 x 10“^ 
for the step size, and select 10“^ as the largest for which SVP converges. For our approach, we test 
the three values 0.6,0.8,1.0 for p and select 0.8 in the same way. 

In the second scenario, we use a more general and practical setting. We randomly generate 
a rank-2 matrix X* e M^ooxeoo as before. We generate m = 7n sparse XjS whose entries are 
i.i.d. Bernoulli: 



1 with probability p, 

0 with probability 1 — p. 


where we use p = 0.001. For all the methods we use the same strategies as before to select 
parameters. For the nuclear norm approach, we try three values p = 10,100, 200 and select p = 
100. For SVP, we test the three values 5 x 10“^, 2 x 10“^, 10“^ for the step size and select 10“^. 
For the gradient algorithm, we check the three values 0.8,1,1.5 for p and choose 1. 


The results are shown in Figures 2a and 2b. In the dense case, our method is faster than the 


nuclear norm approach and slightly outperforms SVP. In the sparse case, it is significantly faster 
than the other approaches. 
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Figure 2: (a) Runtime comparison where X* E ^^ 00 x 400 j-ank-2 and AiS are dense, (b) Runtime 
comparison where X* E is rank-2 and AiS are sparse, (c) Sample complexity comparison. 

6.3 Sample Complexity 

We also evaluate the number of measurements required by each method to exactly recover X*, 
which we refer to as the sample complexity. We randomly generate the true matrix X* E 
and compute the solutions of each method given m measurements, where the AiS are randomly 
drawn from the GOE. A solution with relative error below 10“^ is considered to be successful. We 
run 40 trials and compute the empirical probability of successful recovery. 

We consider cases where n = 60 or 100 and X* is of rank one or two. The results are shown 
in Figure 2c. For SVP and our approach, the phase transitions happen around m = 1.5n when 
X* is rank-1 and m = 2.5n when X* is rank-2. This scaling is close to the number of degrees 
of freedom in each case; this confirms that the sample complexity scales linearly with the rank 
r. The phase transition for the nuclear norm approach occurs later. The results suggest that the 
sample complexity of our method should also scale as 0(rn log n) as for SVP and the nuclear 
norm approach [11, 18]. 


7 Conclusion 

We connect a special case of affine rank minimization to a class of semidefinite programs with 
random constraints. Building on a recently proposed first-order algorithm for phase retrieval [6], 
we develop a gradient descent procedure for rank minimization and establish convergence to the 
optimal solution with 0(r^n logn) measurements. We conjecture that O(ralogn) measurements 
are sufficient for the method to converge, and that the conditions on the sampling matrices Aj 
can be significantly weakened. More broadly, the technique used in this paper—factoring the 
semidefinite matrix variable, recasting the convex optimization as a nonconvex optimization, and 
applying first-order algorithms—first proposed by Burer and Monteiro [4], may be effective for a 
much wider class of SDPs, and deserves further study. 
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A Proof of Lemma 1 


Let A = (ttij) be a random matrix that is GOE distributed; thus Uij ~ A/^(0,1) for i j and an ~ 
A^(0,2). We have E(M) = Henee, it suffiees to show that Ax)A) = 

2a;a;^ for any x G MA. The {i,j) entry of {x~^Ax)A has expeeted value 


Ax) aij) = E EE Xj^XiQjkl^ij I 

V k I 

= EE XkXiKi^Clkl^ij ) 




k I 


EE XkXi ■ 

k I 


0 

E(4) 


if {k,l) ^ {i,j) A {k,l) ^ {j,i) 

otherwise 


2xiXjK{aij) if i ^ j 

x^E,{aj^) otherwise 



if i 7^ j, 
otherwise, 


where we use that the varianee of an is 2 and the varianee of aij is 1 for any i 7 ^ j. In matrix form, 
this is Ax)A) = 2a:a:'''. 

B Ingredients 

We first present some teehnieal lemmas that will be needed later. Reeall Definition 2 that for any 
Z, Z = argmin^g^ ||^ ~ ^\\f' H = Z — Z. The sth eolumn of Z, Z, Z*, H are denoted by 
z*, hg respeetively. We shall use the following formulas for the gradient and seeond order 
partial derivatives: 

. m 

Vf{Z) = -J2{HH^AH) + 2ti{Z^AH)) {AH + A,Z), 

TDj 

i=l 

QzdJ^ ^ m £ (2A2;,^7 a 7 + (tr(Z’^A^) - h) A) , Vs e [r], 

^ * i=l 

f(Z) 1 ^ 

-——^ 2AiZsz'lAj, Vs, k e [r] such that s ^ k. 

ozgozl m ^ 

The next ingredient we need is the expectation of the second order partial derivatives with 
respect to the random measurement matrices. 

Lemma 2. Let A = {aij) be a GOE distributed random matrix. For any two fixed vectors x and y, 
we have E [AxyA] = x^yI + yx^. 
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Proof. The expectation of (i, j) entry of Axy^A is 


¥.[{Axy^A)if\ = E ^ aikttjkXkVi 


k I 




If i = j, then we have 


E[{Axy'^A)ii] = E [ '^a^kXkVk ) = ^XkVk + 


^iVi 


since Var(a|) = 2 and Var(aj^^) = 1 if fc 7 ^ T On the other hand, if z 7 ^ j, then 


E[{Ax'ii' A)if\ = E '^aikajiXkVi = E{a^jXjyi) = xjyi. 


kl 


Therefore, E{Axy'^A) = x^yl + yx^. 

Lemma 3. For all s G [r], it holds that E 

d^fiZ) 


dzgdzj 


□ 


= 2 II 2:^11 -f + ‘^ZgzJ + 2ZZ~^ — 2 X* and 


E 


dzsdzj 


= 2zJZkl + 2zkzJ for all k G [r] such that k s, where the expectation is over the 


random measurement matrices. 

Proof. The case where A; 7 ^ s is a direct result of Lemma 2. For the other case, let A = (aij) be a 
GOE distributed random matrix. It follows from Lemma 1 that 


E 


X-f(Z) 

dzsdzj 


= 2E(Az,zjA) + 2ZZ^ - 2X* 


By Lemma 2, we have 


E{AzszJA) = II^^II I + ZszJ. 

Substituting this back into the above equation, we obtain the lemma. 


□ 


We next recall a concentration result for the operator (spectral) norm of the random measure¬ 
ment matrices. 


Lemma 4. (Ledoux and Rider [14, Theorem 1]) There exists two absolute constants C and p = 
such that with probability at least 1 — 

ll^ill < ^s/n. 

A tighter upper bound is actually given in the rracy-VT/Jow/aw: w.h.p. ||Aj|| = 0{2^/n+rd/^). 

Corollary 1. With probability at least 1 — mCe~^"‘, the average of the squared operator norm of 
the random measurement matrices is upper bounded by 9n. 
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Proof. Applying a union bound we have 


P II < 9nJ > P (Vi, ||Aj|| < Sa/w) 

m 

> 1-5^P(||A,|| >37^) 

i=l 

> l-mCe-P^, 

where we use Lemma 4 in the last line. □ 

The following two teehnieal lemmas are important tools for us. Define the set 

E{e) = {Z I d(Z,Z*) < e}. 

Lemma 5. Suppose that A1 holds: || ^ — 2mm'''|| < for any u such that \\u\\ < 


\/d\. If 6 < -^Cr, then for any Z G E (^y it holds that 

-i m 

2||M^||^-(5||ff||^ < — ^tr{H'^AiHf <5\\H\\l + 2\\HH 


T||2 

If ' 


i=l 


Proof Let hg be the sth eolumn of H. Sinee max^gj^] ||/i<i ||2 < ||-f^||_p < y < \/df, it follows 
from the assumption of the lemma that 


m 


Aihs)Ai - 2hshJ 


i=l 


<-, s = l,...,r. 


By the triangle inequality, we have 


m 




i=l s=l 


s=l 


< 6 


and eonsequently 


-S\\hgf<h] l-J2HH^AH)A-2HH^]hJ <5||h,|^ s = l,...,r, 


i=l 


where we replaee Y1 hjAihg by tr:{H^AiH) and X]s=i hghj by HH^. Taking the sum of the 

S =1 

above inequalities, we obtain 

- m 

-5 ||ff||^ < -y^ix{H^A,Hf - 2tr{H^HH^H) < 6 \\H\\l . 

m ^^ 


i=l 


16 












Note that Therefore, 


2 \\HH^\\l - 6 \\H\\l < < 6 \\H\\l + 2 \\HH 


T||2 
F ' 


2 = 1 


Lemma 6. Suppose that A2 holds: for any Z such that ZZ"' = X* we have 


dV{Z) 


dzsdzj 


-E 


dzsdzj 


< -, s, /c = 1,... ,r. 
r 


Then 


2=1 


□ 


(7) 


a. - \\H\\l + S triH^AZf < ("ai + ^ ) \\H\\l + \\H~^Z''^ 


Proof. Our goal is to bound — ^ tr{H^AiZY. This ean be expanded as 


2=1 


m 


2 = 1 Vs = l 


-j m r -| m 


m 


i=l s=l 


i=l s<k 


We first bound the sum of the quadratie terms. For any s G [r], we have 

d^f{Z) 1 


E 


dzgdzj m 

SV(Z) 


E ^AiZ^zJ Ai, 


2=1 


dzgdzj 


= 2 lbsII / + 2zsZ 




It follows from assumption (7) that for any s G [r], 

-- \\hs\f < — AiZsf - 2 ll^sll^ ll^isll^ - < - ll^s 

r m ^^ r 

i=l 

Taking the sum of above inequalities, we obtain 

r-r -,mr r r e 

0 ,, , ,,9 1 


2r 


E 2 E E('7 Afzy - E ii?.f iiA.ii" - E('7 ^ ^ E ■ w 


S=1 2=1 S=1 S=1 S=1 S = 1 

Similarly, we bound the sum of the eross terms. For any fixed s, k sueh that s k,we have 

_ 1 f(2) V 2A-Z AA- 


E 


dA(zy 

dzsdfl 


2=1 

= 2zlzkl + 2%zJ, 
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and consequently 

6 1 ™ 

- - ^ ||/is|| \\hk\\ < —'^'^2{h]AiZs){hlAiZk) - 2'^zJ Zkhjhk -2'^h]zkzjhk (9) 

s<k 1=1 s<k s<k s<k 

s<k 

We eombine equations (9) and (8) to get 


c 1 m c 

- E II '>■ II S - E ^E »'>>11»'» 


sk 


i=l 


sk 


sk 


sk 


fell • 

( 10 ) 


Note that = tr{H~ZH~Z), ^kK hk = \\ZH~\\ and 


\\hk\\ = ( I <r'^\\hsf = r\\H\ 


2 

F ■ 


sk 

tTv uT 


,s=l 


s=l 


-^l |2 


By Lemma 7, \x{H'ZH'Z) = . Replaeing those terms in equation (10) gives us 


H\\l + \\ZH^\\l + \\H^Z\\^^ < ^ 5^tr(i7T4Z)2 < ^ \\H\\l + \\ZH^\\^^ + \\H^Z 


rTvl|2 


T||2 


rTvl|2 

If ■ 


i=l 


Finally, we obtain the elaim by notieing that 


where = cTmax(^) > ■ ■ ■ > U m^n iZ) = y/d^ are the singular values of Z. □ 

Lemma?. tr{H^ZH^Z) = 

Proof. Let f7 = arg min^/^^T^^Tf/^/— Z*f/||^ = arg maX[|f;T=[ 7 Tf/=/(( 7 , Note that 

{A, B) < ||A||^ ||i?|| for any matriees A, B that are of the same size. The equality holds when 
B = UaVJ where A = Ua^aVJ is the SVD of A. Henee, U = UV^ where USV^ is the SVD of 
Z*~^Z; Z = Z*U. Therefore, Z~^Z = Z~^Z*U = VSV^ is symmetrie and positive semidefinite. 
Thus, Z = Z'^Z — Z~^Z is also symmetric. This implies that tr{H^ZZ) = ||i7^Z||^. □ 


C Linear Convergence 

Proof of Theorem 3 
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2 


Let — Z^. Then we have that 


I ^/c+l _ Z^jj^ _ 


_ 




-Vf(Z^)-Z^ 




|V/(Z 


fc^l|2 


2/j, 


-(Vf(Z^),H^} 


< 


+ 


|V/(Z' 


2fi 


-ar p/1 ; + 




|V/(Z‘)| 


= 11 -^. 


(T rp 


« ELicr. 




< (i_^. M \\hHI 

a rai / " 


= 1 


(i--) 

\ anr j 


d{Z^,Z*f, 


where we use the definition of RC{e^ a, (3) in the third line, || 2 '*||p = ||X*||^ = in the 

third to last line and 0 < /i < min {q;/2, 2/13} in the seeond to last line. Therefore, 


d{Z’^+\Z*) = rnin 
Z£S 




2/i 


<\l- 

F V Q-Kr 


d{Z\Z*). 


D Regularity Condition 

As mentioned before, Nesterov [16, Theorem 2.1.11] shows that the gradient seheme eonverges 
linearly under a eondition similar to the regularity eondition, whieh is satisfied if the funetion is 
strongly eonvex and has a Lipsehitz eontinuous gradient {strongly smooth). In order to prove The¬ 
orem 4, we show that with high probability the funetion / satisfies the loeal eurvature eondition, 
which is analogous to strong convexity, and the local smoothness condition, which is analogous to 
strong smoothness. 

Cl Local Curvature Condition 


There exists a constant Ci such that for any Z satisfying d{Z, Z*) < 



(Vf(Z), Z - Z) > Cl ||z - z||^ + ||(Z - zyzlll 


C2 Local Smoothness Condition 


There exist constants C 2 , C 3 such that for any Z satisfying d{Z,Z*) < 



II V/(Z)||^ < c, ||z -z\\l + C, ||(Z - zyz 
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D.l Proof of the Local Curvature Condition 


^ riL ^ IlL fy IIL 

{Vf{Z),H) = -y2tr{H~^A,Zf + -y2tr{H^AHy+-y2tr{H~^A,Z)tv{H~^AH) 
m m m 

2=1 2=1 2=1 


> p + q - 

m 


2 I 2 

p +q -^ 



\ i=l 


V2\ 


m 


Y^tr{H^AZy 


2=1 


\ 


m 




2=1 


= P 


2^2 






1 


T - i«' = - E 

2 4 m 

2=1 


4 m 


> kr - ml +||i/^z||" - ^ i|ff|i^ - 5 \\Hm\\i 


> 


<^^-1 llff IIf 


-5 \\H\\i+ H'Z 


^||2 


If • 


where we use Cauehy-Sehwarz inequality in the 2nd line, the inequality (a — b) in the 

11 ”P II II 112 

5th line, Lemma 5 and 6 in the 7th line, and the faet that ||-f7-f7 IIf<II if 11^ in the 8th line. Sinee 
ll-f^ll F < \/^CTr and 6 < ^(Tr, we have 
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(V/(Z),ii)>^crF||ii||^+||ii^Zl 


F • 


( 11 ) 


D.2 Proof of the Local Smoothness Condition 

We need to upper bound || V/(Z')||^ = max||vi/||^=i \{Vf{Z), W)y. It suffiees to show that for 
any W G of unit Frobenius norm, | (V f{Z), IF) p is upper bounded if Z e E ■ 
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Since {a + b + c + < 4(a^ + 6^ + + d^), we have 

1 


|(V/(Z), W)\^= \i:J2 + 2ix{H^AiZ)) {ir{W^A,H) + ir{W^AZ)) 

i=l 
m 

tr{H^AiH) tr{W^AiH) + 2 tr{H^AiZ) tr{W^ AiH) 


m 

1 


i=l 


+ AH ' AiH) AW ' AiZ) + 2 ah ' AiZ) tr(iy ' 

2 


< 4 1^^ AH^AH) AW^AH) ) + 4 I — AH^AZ) AA^AiH) 


2=1 

4 I - 
m 


2 = 1 


^ AH^AH) AA^AiZ) + 4 - tr(if^4Z) tr(iy^AiZ) 


2=1 


2=1 


The first term in the righthand side ean be upper bounded as 

2 


2 = 1 


< 4(2||ff||J, + i||i/||^) 


1 

m 


Y^^iA^AH) 

2 = 1 

2=1 
m 

= 4(2||//||J, + i||i/||^) I 


m 

1 


m 

1 


1^ 


2=1 

m 


< iinat+^wwl) -EPif iii?ii 


2=1 

2 


< 3Qn\\Hfp{2\\H\\l + 6\\H\\A, 

where we use the Cauehy-Sehwarz inequality in the first and seeond line, Lemma 5 and | \HH^ 11 f — 
11 2 

IIiTlip in the third line and Corollary 1 in the last line. 

The other three terms are bounded similarly. For the seeond term, we have 

2 


4 - 


m 


Y AH^AZ) AA^AiH) 


2=1 


< 16 I — 

m 


Yir{H^AZ)A -5^tr(hF^4i7) 


2 = 1 


2=1 


< 36n ||i7||^ ((4cri + 25) \\H\\l + 4 l|i/^^l|p) , 


where we use Lemma 6 and 1. The third term is bounded as 

2 


4 I - 

m 


Y AH^AH) AA^AiZ) 


2 = 1 


< 41 - 
m 




2 = 1 
I7^I|2 


2 = 1 


< 36n||Z||^(2||i7||i + 5||i7||J), 
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and the fourth term is bounded as 


m 


4 1 -^ ix{H^AiZ) ix{W^AiZ) ) < 16 ( ^ ^ ix{H^A,Zf ) [ ^ A,Zf 


m 


2 = 1 


2=1 

< 36n ||Z||^ f(4ai + 25) \\H\\l + 4 


m 


i=l 


Putting these inequalities together, we have 


II V/(Z)||^ < 36n ||Z||^ + lli^ll^ 2 \\H\\l + (4ai + 35) ||if||^ + 4 ||if ' Z| 


Henee, 


l|V/(Z)| 


144n(||Z||" +1IJ/II 


<(<r, + lml+^-S)\\H\\l+\\H-^Z\\l 


Sinee ||-ff||^ < J Aur and 6 < ^cTr, we have 


144n Z + (3/16 )ct^ ' ^4 


D.3 Proof of the Regularity Condition 

Now we combine the curvature and the smoothness conditions. For any 7 G ( 0, — ), it holds that 


(T r' 


l|V/(Z)i| 


144n (^||z||^ +(3/16)a^) 

Combining equation (11) and (12), we obtain 

(V/(Z). H)> (G _ 


< ■ i ml + \\H-^ zt. 


( 12 ) 


iiv/(z)ii; 


144?7,(||z||p + (3/16)crr) 


l|V/(Z)||‘ 


,27 73 \ „ ^^,,2 CT^ 

- U4 64 7^" 144n(p||p +(3/16 )(t,)' 


If we take 7 = then 


{Vf{Z),H}> --V 

24 (^1 3-144n M|zp +(3/16)aJ 


1 9 

> —Cr ||i7|| p + 

~ 24 513ni|Z* 


F • .. 77.„2 IV/(Z)lli. 
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where we use ||^||^ = ||2’*||^ = ||X*||^ > (Tr- Thus we have 

{Vf(Z),H) > iff, IlffllJ. + —^ l|V/(Z)||J. 
“ p 11^ IIf 

for a > 24 and B > — ■ 513n. 

— ' — ar 


E Initialization 


Proof of Theorem 5 

By assumption, we have 



-l m 

-j2i^fAzt)A-2z:zf 

m 

i=\ 

5 

< 

r 

s e [r]. 

Hence, 




||M-2X*|| = 

^ m r r 

2=1 S=1 S = 1 

r 

S=1 

-I m 

- ZizfAz'M, - 2z:zf 

2 = 1 


(13) 

Let be the eigenvalues of M. By Weyl’s theorem, we have 

IA'^ — 2as I < s e [n]. 


Since 6 < ar, it is easy to see > ■ ■ ■ > A(. > 5 and |A(| < 6, s = r + 1,. .. ,n. Hence, As = A'^, 
s G [r], and is the best rank r approximation of Therefore, 




< 

= 


2^ 


1 1 
2 2 


*T 


< ^PZr { 

^O^OT _ 1^ 

+ 

-M - Z*Z*^ 

V 

2 


2 


< 


^o^oT _ 


= llAr+ll < 


where we used ||v4||^ < A/rank(A) ||v4|| in first line, the fact 
and inequality (13) in the last line. 

Let H = Z^ — We want to bound (i(2’°, Z*)^ = ||i?||p. According to the discussion in 
Lemma 7, H^Z^ is symmetric and is positive semidefinite. 
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The following step elosely follows [21]. It holds that 


^O^OT 




^O^oT _ 

F 

+ HH^ 

F 

tr + HZ^^HZ^^ + HH^Z^^ 

+ Z^H^Z^H~^ + HZ^^Z^H + HH^Z^H^ 


+ z^H^HH^ + HZ^^HH^ + j 

= tr + 2{H^Zy + 2(iT^if)(Z°^Z°) + H){H^Z°)^ 

= tr (^H~^H + V2H^Z^y + (4 - 2V2){H^H){H^Z^) + 2{H^H){Z°^Z^) 

> tr (^(4 - 2V2){H^H){H^Z^) + 2{H^H){Z^Zy 
= tr (^(4 - 2V2){H^H){Z^^Z^y + tr (^(2^2 - 2)(i7^ff)(Z^Z)) , 


where in the fourth line we used the property that the traee is invariant under eyelie permutations 
and H^Z^ = Z°^H. 

Sinee Z^~^Z^ is positive semidefinite, tr((if^iJ)(Z°^Z°)) is nonnegative. Henee, 


^O^OT 


z*z*~^ 


> 


> 


{ 2 V 2 - 2) tr {{H^H){Z^Z)) 

{ 2 V 2 - 2 ) \\HZ^\\l 
{ 2 V 2 - 2 ) \\H\\lar 
{ 2 V 2 - 2)ard{Z^, Z*Y. 


If(5 < 


(T j> 

4v^ 


, then 


d{zYz*Y < 


Z^Z^-Z^Z*^\Y 2r5‘^ 3 

-^- — < -^- < -O'r- 

(2y/2-2)ar {2^/2-2)ar 16 


F Sample Complexity 

In this seetion, we verify that our assumptions hold with high probability if m > cn log n, where c 
is a eonstant that depends on 5, r, and k. Our proof relies on the following eoneentration inequality. 

Theorem 8. (Matrix Bernstein Inequality [20]) Let Si,..., Sm be independent random matrices 
with dimension n x n. Assume that E(S'j) = 0 and ||S'i|| < L, for all i G [m]. Let = 
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max 


{|1E:1iE(^«^7)|1 , |1E::iE(^7^0||}- Thenforall5> 0, 


P 


1 

m 





< 2nexp 


\i/2 + Lm6/3 ) 


We first give a technical lemma that we will use later. 

Lemma 8. Let A = [aij) be a random matrix drawn from GOE. Let S = anA — 2eieJ. There 
exist absolute constants C, p such that with probability at least 1 — Ce~P^, we have 


ll^ll < 18n. 

Proof. Let A = A — auCieJ. S = a^A + — 2)eieJ. Note that an and A are independent, 

hence \\S\\ < |aii|||y4|| + \an — 2|. Besides, since an ~ 7V(0,2), we can see that an/2 is 
distributed. 

First we bound the operator norm of A. We rewrite ||y4|| as 

||A|| = max \u^Au\ = max \il^D u — du\\ < ||D|| + \d\, 

||u||=l ||u||=l 


where D = A + dcicj, d ~ A/^(0, 2). As D is GOE distributed, by Lemma 4, 

P(||D|| > 3^^) < (14) 

where C and p' are absolute constants. 

LFsing the Gaussian tail inequality, we have 

P(|d| > 2^/^) < 2e-’". (15) 

Combining inequalities (14) and (15), we have 

P (^||I|| > < P (||D|| > 3^/n V \d\ > 2^/d) < + 26"", (16) 

where the last inequality follows from the union bound. 

Next we bound the deviation of the term. By the corollary of Lemma 1 in Laurent and 
Massart [13], we have 


P(|a?i-2| >4(xAC + n)) < 2e“". (17) 

Since an is identically distributed as d, inequality (15) holds for an as well. Namely, P (|aii | > 2^/n) < 
2e“”. Combining this with inequalities (17), (16), we have 

P (II^11 < 14n + 4^/d) > 1 - 

Linally, the statement is obtained by choosing proper C, p, and using ^/n < n. □ 
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F.l Proof of Theorem 6 

Proof. It is equivalent to show that for any unit veetor u, with high probability, 


m 


Aiu)Ai — 2 uv7 


i=l 


< 


rai 


If P is an orthonormal matrix, then 


^ m 

- V ((Pu)^Ai(Pu)) A - 2(Pu)(Puf 

TR 

= 

^ m 

— V iu^iP'^AiP)uAi) - 2Puu^P^ 

2 = 1 


' 1 1 

2=1 


m 

1 

m 


- 2uu 

i=l 
m 

AiuAi — 2uu 


2=1 


where in the seeond line we use unitary invarianee of the operator norm, and in the last line we 
denote AiP by At. Sinee the GOE is invariant under orthogonal eonjugation, Ai and Ai are 
identieally distributed. Henee, it suffiees to prove the elaim when u = ei, i.e. 


1 


m 


^ a^^lAi - 2eieJ 


2=1 


< ^0) 


where is the (1,1) entry of A^ and Sq = ^. 


rai 


To show this, we apply Theorem 8, where Si = a^lAi — 2eieJ. This requires that the operator 
norm of Si is bounded, for eaeh i. We address this by notieing that with high probability US'*!! < 
18n, \/i. To be preeise, by Lemma 8 there exist eonstants C, p, sueh that 

P (||S'j|| > 18n) < i = 

Taking the union bound over all the SiS leads to 

P ^max||S'j|| > 18nj < mCe~^"‘. 


(18) 


Next, we ealeulate ll®(>S'i)||- Let A = (aij) denote Ai, S denote Si. We 

have E(S'^) = E{aifA‘^) — 4eie7, and 




,2 a2 

ill 


k=2 


(®11^ )ii~ “I" ^ik ) ) 7^ 1) 


k^i 


(®ll^ ') ij ~ 'y ^ O^ikOijki Vi 7^ j. 


k=l 
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It is easy to see that = diag(2n + 10,2n + 2,... ,2n + 2). Consequently, = (2n + 6)m. 

By Theorem 8, if m > . . ■ n log n, then 

J ’ — inin((5g,(5o) ° ’ 


P I — 'V Si > SoJ < 2n exp 

< 2?7,exp 

< 2nexp 

2 

< — . 

Combining inequalities (18) and (19), we eonelude that 


—m5l 


2n(l + 35o) + 6 

—m5l 
2?7-(4 + 3(5o) 
—m5l 

14n ■ max(l, 


P 


^aflAi - 2eiel 


m 


i=l 


< <5o > 1 - mCe-f’^ - 




(19) 


□ 


F.2 Proof of Theorem 7 


The formulation of the seeond order partial derivatives and their expeetations is given in Appendix 
B. 

It is easy to see that for any Z G S, max^gfr] ||zr|| < a/^A- Thus it is suffieient to prove that for 
any two unitary veetor u and y with high probability it holds that 

5 

rcTi 


! t L 

— 2Aiuy^Ai — 2v7yl — 2yv7 


We ean deeompose y as y = f3u + (3±u± for a eertain unit veetor u± that is orthogonal to u, 
where =1. Let ()o =-• It suffiees to prove the following two elaims. 

2rai 


(i) For any unitary veetor u, with high probability 

1 


m 


2AiUU^Ai — 21 — 2 uv7 


2=1 


< <5o. 


(ii) For any two orthogonal unit veetors u and u±, with high probability 


1 

m 


m 

2AiUii\^Ai — 2u±^u^ 

2 = 1 


< <5o. 
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Proof of (i) 

If P is an orthonormal matrix, then 
1 


m 


2AiPuv7 P Ai — 21 — 2 Puv7 P^ 


2 = 1 


1 

m 

1 

m 


2P^ AiPuu^ P^ AiP — 21 — 2uu 

2=1 
m 

2AiUv7 Ai — 21 — 2uu 


2=1 


where ^4* and Ai have the same distribution. Henee we only need to prove the case where u = ei. 


m 




2=1 


< <^ 0 ) 


where = AiCi is the first column of A^. 

Let Si = —I — eiej). To apply Theorem 8 , we need to show that with high 

probability IIS'*!! is bounded for each i and calculate = ||X]r=i 

Let S,v, A denote Si, and A^^'^ respectively. It is easy to see that 

il'S'il < 2 ||n||^ + 4 = 2{w + + 4, 

where w = Ylk =2 ~ -^(O^ 2 ), aik ~ ■A/'(0, 1 ) for fc 7 ^ 1 , we can see that af ^/2 and w 

are distributed with degrees of freedom 1 and n — 1, respectively. Using the tail bound, we 
have 

P (®ii/2 > 2[y/n + 77-) + < e 

P(te > 5n — 1) < e“"', k = 2,... ,n. 

It follows from the union bound that 

P(||^i| > 26n + 6) < 2e-”, 

and consequently 

P ^max IIS'*!! > 26n + 6j < 2me“"'. 

To calculate z/^, we expand E(S'^) as 

E{S^) = 4E ((un^)^) - 4(/ + eic^ f 
= 4E (||u||^ uu"'') — 4(1 + 3eie7). 

Some simple calculations show that 


( 20 ) 


l|2 T 
U VV 


k=2 

||2 T\ 2 2,4, 22-0 

^^11 )jj = Vi Vj +Vj + ^Vk Vj , J = 2,... ,n, 

n 

v\?vv"').^ = ^Vk^VjVi, j < 1. 


k=l 
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As vi ~ 7\^(0,2), Vj ~ A^(0,1) for j ^ 1, 

E (llr’ll^ = 2n + 10, 

E (||r;||%r;^)^.^. = n + 3, j = 2, 

E (||i;||%r;^)^.^ = 0, j </. 

Hence, E(S'^) = diag( 8 n + 24,4n + 8 ,..., 4n + 8 ) and thus = m( 8 n + 24). 
If m > (128/min(^Q, ^o))’^logn, then by applying Theorem 8 we can see 


P 


-| UL 

— V -21 - 2eie7 



< 2 ? 7 ,exp 1 

( —m^Q 

y 8 n + 24+ (fn + 2)5c 

< 2 nexp 1 

^ —m(5o \ 

y (128/3)nmax(l, (5o) / 

2 

< —. 
ri^ 



Combining inequalities (21) and (20) leads to 


P 


UL 

— V 2n«nW^ - 2/- 2eie7 



> 1 - 2me"” 
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Proof of (ii) 

We only need to prove the case where u = ei and u±^ = 62 due to the same reason above. That is, 


m 


-2e2el 


2=1 


< <5o. 


where and are the first and second columns of A*. 

As before, let Si = 2{v^'^'>— 626 ^) and let 5, v, g, A denote Si, and respec¬ 

tively. From the proof of (i), we can see that with probability at least 1 — 46“"^ both ||r;|| and ||g|| 
are no larger than v^IfinT+T. Since \\S\\ < 2 ||t>|| ||g|| + 2 , we have 


P ^max ||S'j|| > 26n + 4j < 4me 

Next, we calculate z/^ = mmax { ||E(S'S'''')|| , ||E(S''''S')||}. 

E(^^^) = 4E(||g||^)E(nr;^) -f 4e2eJ. 
E(^^^) = 4E(||n||^)E(gg^) -f 4eie7. 
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Some simple calculation shows that E(||t>||^) = E(||g||^) = n + 1, E(t>n''') = / + eie7 and 
E{qq~^) = / + 626 Hence, 

E(S'S'''') = 4 ( 77 , + 1)/ + 4(n + l)eie7 + 4e2eJ, 

E(S''''S') = 4(?7, + 1)/ + 4(n + l)e 2 ej + 4eie7, 
and = 8(n + l)m. If m > ,^^^ nlogn, then by applying Theorem 8 we have 


P 


This means, 


^ m 

- -2eieJ 

m 



( —m6n \ 

/ —m6n \ 

U6nmax(l,5o)J 

2 


P 






> 1 — 4me 



( 22 ) 


G ADMM for Nuclear Norm Minimization 

We reformulate the nuclear norm minimizing problem as 

njn T p(x) - bf + ||X||., (23) 

XgRnxn 2A 

where A > 0 is the regularization parameter. A —)■ 0 will enforce the minimizer satisfying the 
affine constraint = b. 

We apply ADMM to the dual problem of (23): 

II l|2 Tt 
mm — q; — a b 

aeiR™,V'eK"X" 2 

subject to ||I4|| < 1 

A^ia) = V, 

where we introduce an auxiliary variable V to make this problem equality constrained. 

The augmented Lagrangian of problem (24) can be written as 

L,{a, X) = ^ ||a||2 - a~^b + 1 h<i{V) + {X, (a) - ^) + | ||AI’"(«) - ^||^ 

where X is the multiplier, rj is the penalty parameter, and l||.||<i is the indicator function of the 
unit spectral norm ball i.e. l|| ||<i(14) equals 0 if ||1A|| < 1 and +cx) otherwise. 
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Let vec(-) denote the veetorization of a matrix, whose inverse mapping is denoted by mat(-). 
We ean rewrite the transformations as A{X) = Aveo(X) and {a) = mat(A^a) = YZi A, 
where A is a m x matrix whose ith row is veo(Aj)^. 

The ADMM starts from initialization (q!°, V^, X°) and updates the three variables alternately. 
The updates ean be eomputed in elose forms: 


a 


fc+i 


= (AJ + r]AA^)-^ ( b + Avee {7]V^ - , 


l/^+i=proj( + 

i=l ' 

/ m 

Xk+i + ) > 


^ 2 = 1 

where proj(-) is the projeetion onto the unit speetral norm ball. Let X = UT,V~^ be the singular 
value decomposition of X, 

proj(X) = U min(E, 1)1/"''. 

In fact, the update of V can be combined with other steps without being computed explicitly. One 
only has to iterate the following two steps: 


= 


2=1 


(AJ + r^AA^)-^ j^b + Avec(r7 ^ af A* + - 2X^) ), 

X^+^ = proxfr/f^ af+Ui + , 


2 = 1 


where prox (■) is the singular value soft-thresholding operator defined as 


prox^(X) = U max(E — r], 0)1A . 

The sequence of multipliers {X^} converges to the primal solution of (23). To speed up the update 
of a, the Cholesky decomposition of XI + r]AA^ is precomputed in our implementation. 


31 


